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1  Introduction 

In  this  report  we  suggest  a  general  approach  and  a  specific  test  statistic  which  may 
help  to  detect  the  presence  of  ‘interesting’  multivariate  structure  in  a  large,  high¬ 
dimensional  data  s«;t.  Our  goal  is  to  supply  a  preliminary  test  which  is  fairly  easy 
to  compute  and  which  might  predict  whether  it  is  worthwhile  to  carry  out  more 
computationally  intensive  procedures  such  as  projection  pursuit  or  cluster  analy¬ 
sis.  The  basic  idea  underlying  this  approach  is  that  a  data  set  (or  distribution)  in 
which  the  coordinates  (covariates)  are  independent  is  ‘boring’.  In  such  a  data  set, 
the  multivariate  structure  is  entirely  determined  by  the  marginal  distributions.  For 
example,  if  a  data  set  with  independent  coordinates  contains  clusters,  these  clusters 
can  be  detected  by  merely  examining  the  marginal  distributions.  More  generally, 
a  data  set  may  be  considered  ‘boring’  if  there  is  some  simple  transformation  which 
converts  it  into  a  data  set  with  independent  coordinates.  For  example,  an  appropri¬ 
ate  linear  transformation  will  convert  the  multivariate  normal  (MVN)  distribution 
into  a  distribution  with  independent  normal  coordinates,  and  thus  the  MVN  distri¬ 
bution  is  ‘boring’.  If  a  distribution  is  ‘boring’  in  the  sense  we  have  outlined,  the 
multivariate  structure  is  trivial  and  there  is  no  point  in  carrying  out  procedures  such 
as  projection  pursuit  or  cluster  analysis. 

Our  approach,  put  briefly,  is  as  follows:  Given  a  data  set,  we  attempt  to  find  a 
simple  transformation  which  converts  it  into  a  new  data  set  in  which  the  coordinates 
are  (at  least  approximately)  independent.  After  transforming  the  data,  we  test  the 
null  hypothesis  of  independence  by  discretizing  each  coordinate  and  analyzing  the 
resulting  categorical  data  as  a  contingency  table.  We  can  compare  the  cell  counts 
in  this  contingency  table  with  those  expected  under  independence  and,  if  a  formal 
test  statistic  is  desired,  we  can  employ  the  usual  chi-squaxed  test  of  independence 
for  contingency  tables.  If  we  resoundingly  reject  the  hypothesis  of  independence, 
then  the  data  set  has  ‘interesting’  multivariate  structure  and  more  computationally 
intensive  procedures  should  then  be  used  to  determine  the  form  of  this  structure. 
If  we  fail  to  reject  the  hypothesis  of  independence  (and  there  is  also  no  evidence 
of  structure  in  the  usual  bivariate  scatter  plots  of  the  data),  then  the  data  set  is 
probably  ‘boring’  and  further  exploration  with  expensive  techniques  might  not  be 
worthwhile. 
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There  may  be  very  few  high-dimensionai' rate  data  sets  which  axe  ‘boring’  in 
the  sense  given  above.  However,  the  same  approach  can  also  be  used  to  examine 
residuals  obtained  after  model  fitting.  Typically,  if  the  model  is  correctly  chosen, 
one  expects  the  residuals  to  be  without  structure.  The  methods  we  present  may 
prove  useful  in  detecting  any  remaining  structure  in  the  residuals. 

Any  preliminary  examination  of  a  data  set  should  include  studying  the  marginal 
distributions  axid  looking  at  numerous  bivariate  scatter  plots.  The  procedure  de¬ 
scribed  in  this  report  does  not  in  any  way  replace  these  elementary  techniques. 

2  Detailed  Description  of  the  Method 

Let  X  be  2in  n  X  p  data  matrix  with  elements  .  The  n  rows  of  X  are  obtained 
by  random  sampling  from  some  p-variate  population  having  a  continuous  joint  dis¬ 
tribution. 

A.  Transform  the  Data:  The  data  should  be  transformed  to  remove  known,  ob¬ 

vious  or  suspected  dependence/structure.  (Our  primary  goal  is  to  discover 
the  existence  of  surprising  or  unsuspected  multivariate  structure,  hence  we 
typically  try  to  remove  known  or  obvious  dependence  such  as  that  indicated 
in  the  correlation  matrix  or  in  bivariate  scatter  plots.)  There  are  an  infinite 
variety  of  transformations  which  can  be  used;  we  shall  discuss  a  few  of  the 
possibilities  later.  In  the  remainder  of  this  description,  the  transformed  data 
will  be  denoted  by  Z,  an  n  x  p  matrix  with  elements 

B.  Discretize  the  Data:  Choose  an  integer  </.  Replace  the  continuous-Vcilued 

quantities  by  discrete- valued  quantities  tij  which  take  on  the  values  1, 2, . . . ,  d. 

This  is  accomplished  by  dividing  the  values  in  each  column  of  Z  into  d  groups 
of  equal  size  n/d,  that  is,  we  set  Uj  =  k  if  (A:  —  l){nfd)  <  Vij  <  k{n/d)  where 
r,j  is  the  rank  of  Zij  among  the  values  of  the  column  of  Z.  The  matrix  with 
entries  will  be  called  T.  To  avoid  complications,  we  shcill  always  assume 
that  d  divides  n  exactly.  In  practice,  if  we  are  given  a  data  set  not  exactly 
divisible  by  d,  we  simply  throw  out  a  few  observations  (at  most  d  —  1)  chosen 
at  random.  Since  d  is  small  (typically  2  <  d  <  4),  we  lose  little  by  doing  this. 

C.  Form  a  Contingency  Table:  There  are  (P  possible  p-vectors  tt  =  (tti,  7r2, . . . ,  TTp) 

with  1  <  TT,  <  d  for  all  t.  These  vectors  may  be  regarded  as  cells  in  a 
d  X  d  X  •  •  •  X  d  contingency  table.  For  each  cell  ir  we  compute  the  cell  count  U„ 
which  we  define  to  be  the  number  of  observations  (rows  of  T)  which  ‘belong’ 
to  TT.  More  formadly,  =  ^{i  :  t,.  =  tt}  where  denotes  the  row  of  T. 

D.  Study  the  Distribution  of  Cell  Counts:  Now  we  study  the  frequency  dis¬ 

tribution  of  the  (P*  cell  count  values  t/,r  and  compare  this  with  the  frequency 
distribution  expected  under  the  null  hypothesis  of  independence.  Let  Mk  de¬ 
note  the  number  of  cells  containing  exactly  k  observations,  that  is,  Mk  = 


2 


(2.1) 


#{ir  ;  Uw  =  fc}-  It  is  clear  that 


The  values  of  Mk  which  «ire  expected  under  independence  can  usually  be  well 
approximated  by  a  Poisson  frequency  distribution  with  a  mean  of  n/d'’ ; 

MkfucP  ^  ^ 

If  the  observed  values  of  Mk  differ  greatly  from  the  expected  Poisson  fre¬ 
quencies,  this  is  evidence  for  the  existence  of  ‘interesting’  higher  dimensional 
structure. 

It  is  intuitively  clear  that  the  presence  of  ‘interesting’  structure  in  the  data 
will  tend  to  increase  the  variability  of  thf*  cell  counts  U.^.  This  suggests  using 
the  sample  variance 

of  the  cell  counts  as  a  test  statistic  for  the  existence  of  structure.  This  statistic 
is  proportional  to  the  usuad  chi-squared  statistic  for  testing  independence  in 
contingency  tables.  We  reject  the  hypothesis  of  independence  when  W  is 
sufficiently  large.  We  have  obtained  approximations  and  o-^  (given  later) 
for  the  mean  and  vairiance  of  W  under  the  null  hypothesis.  These  may  be  used 
to  conduct  rough  hypothesis  tests  based  on 


These  tests  must  be  used  with  caution;  with  the  usued  o-levels  of  .05  or  .01 
they  are  likely  to  detect  differences  from  the  null  hypothesis  which  are  too 
small  to  be  of  any  practiced  significance.  This  is  typical  of  statistical  testing  in 
any  situation  involving  large  sample  sizes.  Rather  them  testing,  it  may  be  more 
useful  to  directly  compare  the  magnitudes  of  W  and  perhaps  in  terms  of 
the  ratio  . 


Transformation  T1 

Step  A  is  obviously  the  only  difficult  step.  One  possible  transformation  is  to  replace 
the  raw  data  X  by  the  principal  components  Z  =  XF.  Here  F  is  a  p  x  p  orthogonal 
matrix  which  diagonalizes  the  sample  covariance  matrix  E  of  X ;  F*EF  is  a  diagonal 
matrix.  In  the  Examples  section,  we  shall  refer  to  this  transformation  as  Tl.  This 
sort  of  transformation  removes  the  correlation  structure  in  the  data  and  is  a  common 
initial  step  in  many  statistical  procedures;  see,  for  example,  Friedman  (1987).  In 
our  situation,  the  rationale  for  the  transformation  is  as  follows.  Suppose  there  exists 
an  orthogonal  coordinate  system  in  which  the  coordinates  are  actually  independent, 
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that  is,  an  observation  X  =  (^"1,  Xj, . . . ,  Xp)  is  obtained  as  A"  =  KA*  where  A 
is  an  orthogonal  matrix  and  the  coordinates  (Yj,  ^2,  •  •  • » ip)  of  y  are  independent. 
If  Veir(y\)  >  Var(y2)  >  •••  >  Var(V^)  and  the  sample  size  n  is  sufficiently  laurge, 
then  r  A  and  the  principal  components  Z  will  be  approximately  equal  to  the 
independent  coordinates  Y.  However,  if  Var(Y)  =  Var(y})  for  some  i  ^  j  (or  even 
if  there  is  approximate  equ2dity),  then  the  principal  components  Z  will  not  usually 
be  very  close  to  the  values  Y.  So  this  transformation  does  not  always  succeed. 

Transformation  T2 

Another  trcinsformation  which  may  be  useful  is  to  sepcirately  transform  each  of  the 
marginal  distributions  (columns  of  X)  to  normality,  and  then  further  transform  the 
data  (using  a  linear  transformation)  so  that  the  variables  are  uncorrelated.  Stating 
this  in  detail  we  have: 

1.  Transform  each  variable  (column)  to  normality.  This  is  done  in  the  usual  way. 
Let  r,j  denote  the  rank  of  i,j  among  the  vailues  of  the  column  (variable). 
Replace  by  its  normal  score  =  $“^((rjj  —  ^)/n)  where  $  is  the  standard 
normal  distribution  function.  The  n  x  p  matrix  of  normal  scores  y,-,  will  be 
denoted  by  Y.  (This  use  of  Y  is  unrelated  to  the  earlier  usage.) 

2.  Now  apply  a  linear  transformation  to  Y  chosen  so  that  the  new  variables  Z  are 
uncorrelated  and  have  variance  equal  to  one.  That  is,  choose  a  p  x  p  matrix 
A  such  that  the  transformed  data  Z  =  YA  has  a  sample  covariance  matrix 
equal  to  the  identity  matrix  I .  There  are  many  correct  choices  for  the  matrix 
A. 

In  the  examples  section,  we  refer  to  this  transformation  as  T2.  To  motivate  this 
procedure,  suppose  that  we  have  sampled  X  from  a  population  which  is  roughly 
similar  to  the  multivariate  normal  (MVN)  distribution  (and  therefore  uninteresting). 
In  this  case,  there  is  reason  to  hope  that  the  initial  treinsformation  to  marginal 
normality  will  cause  the  data  to  closely  resemble  a  sample  from  2ui  MVN  population. 
Then  the  linear  transformation  to  remove  the  correlation  will  convert  the  data  into 
what  is  essentially  a  sample  from  the  MVN  distribution  with  covariamce  matrix 
equal  to  the  identity,  and  for  this  distribution  the  coordinates  are  independent  as 
desired. 

The  previous  two  transformations  are  of  general  utility.  However,  in  some  cases 
one  may  have  to  tailor  the  transformation  to  the  particular  data  set.  For  example, 
if  a  bivariate  scatter  plot  of  variable  i  (A,)  versus  variable  j  (Xj)  reveals  a  curved, 
approximately  quadratic  relationship  between  A,  and  Xj,  then  a  reasonable  trans¬ 
formation  might  replace  A,  by  the  residuals  from  the  regression  of  A,  on  Xj  and 
Xj.  This  transformation  would  remove  the  curvature  from  the  scatter  plot. 
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3  Examples 

Sampling  from  the  Multivariate-Normal  Distribution 

The  MVN  distribution  is  surely  the  most  boring  distribution.  We  consider  it  boring 
because  a  linear  transformation  converts  it  into  a  distribution  with  independent 
coordinates.  Researchers  studying  projection  pursuit  methods  consider  it  boring  for 
different  reasons  (see  Huber  (1985)).  We  shall  use  the  MVN  distribution  to  study 
the  null  behavior  of  our  methods. 

In  our  first  example,  X  is  a  1024  x  10  matrix  composed  of  independent  columns 
generated  from  a  standard  normal  distribution.  This  data  has  no  structure  of  any 
sort.  If  we  suspected  this  in  advance,  we  would  skip  the  transformation  (Step  A) 
eind  just  carry  out  the  remciining  steps  B  to  D.  We  have  a  simple  computer  program 
which  carries  out  steps  B  to  D.  In  this  case  we  obtain  the  output: 

For  the  number  of  cuts  =  2, 

The  frequency  distribution  of  the  cell  counts  is: 

0  1  2  345678 

Observed  377.00  387.00  169.00  73.00  11.0  6.00  1.00  0.00  0.00 
Expected  376.71  376.71  188.35  62.78  15.7  3.14  0.52  0.07  0.01 

The  moments  of  the  distribution  of  cell  counts  are: 

mean  variance  skewness  kurtosis 
Observed  1  1.0332  1.09910  1.35996 
Expected  1  0.9893  0.98396  0.94713 

The  z-score  for  the  variance  of  cell  counts  ■  1.004 


We  have  chosen  to  set  d  =  2 ;  the  computer  output  reports  this  as  the  ‘number 
of  cuts’.  This  means  we  have  divided  the  data  space  into  cP  =  2*°  =  1024  cells. 
There  are  also  n  =  1024  observations  (rows  of  X),  so  that  the  average  number  of 
observations  per  cell  is  n/<?’  =  1.  The  output  lists  the  ‘observed’  frequency  distribu¬ 
tion:  377  cells  are  empty,  387  cells  contain  exactly  1  observation,  169  cells  contain 
exactly  2  observations,  etc.  The  ‘expected’  frequency  distribution  is  computed  using 
the  Poisson  approximation  mentioned  earlier.  The  observed  eind  expected  frequency 
distributions  are  quite  close;  the  differences  are  what  one  would  expect  from  random 
variation. 

The  ‘observed’  moments  are  computed  from  the  observed  frequency  distribu¬ 
tion.  (The  skewness  and  kurtosis  have  been  standardized  by  the  variance  in  the 
usual  way.)  These  can  be  comp2U‘ed  with  the  ‘expected’  moments;  the  true  mo¬ 
ments  under  the  assumption  of  independence.  These  ‘expected’  moments  are  not 
based  on  the  Poisson  approximation;  they  are  the  exact  moments.  The  formulas 
for  these  moments  are  given  in  the  next  section.  To  aid  in  comparing  the  observed 
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eind  expected  variance  (1.0332  versus  0.9893), "we  have  computed  the  z-score  which 
attained  the  modest  value  of  1.004 . 

In  summary,  for  this  example  our  procedure  has  found  no  evidence  of  depen¬ 
dence/structure.  This  agrees  with  the  known  truth  in  this  case. 

Suppose  we  did  not  suspect  a  priori  that  this  data  set  had  independent  columns. 
We  would  then  have  performed  some  type  of  transformation  in  Step  A.  Using  either 
Tl  or  T2  would  still  lead  to  the  conclusion  that  no  structure  is  visible  in  this  data. 
For  example,  using  T2  (with  the  matrix  A  chosen  to  be  upper  triamgular)  leads  to 
the  output: 


For  the  number  of  cuts  ■  2, 

The  frequency  distribution  of  the  cell  counts  is: 

0  1  2  345678 

Observed  367.00  396.00  182.00  62.00  10.0  5.00  1.00  1.00  0.00 
Expected  376.71  376.71  188.35  62.78  15.7  3.14  0.52  0.07  0.01 

The  moments  of  the  distribution  of  cell  coTints  eire: 

mean  variance  skewness  kurtosis 
Observed  1  1.00391  1.20582  2.37976 
Expected  1  0.98930  0.98396  0.94713 

The  z-score  (conservative)  for  the  variance  of  cell  counts  *  .334 
Upper  bound  for  z-score  ■  1.37 


There  is  a  discrepancy  between  the  observed  and  expected  kurtosis,  but  this  meems 
little  as  the  sample  kurtosis  is  highly  variable.  Note  that  we  now  report  two  different 
z-scores.  The  transformation  T2  maikes  the  columns  of  Z  (the  transformed  data) 
have  sample  correlations  exactly  equal  to  zero,  which  would  not  occur  if  the  columns 
of  Z  were  actually  independent.  This  makes  the  variance  of  the  cell  counts  somewhat 
smaller  than  would  occur  under  the  assumption  of  independence.  Thus  the  usual  z- 
score  is  conservative.  A  crude  and  heuristic  degrees  of  freedom  correction  is  used  to 
obtain  a  larger  z-score  (reported  m  the  output  as  the  ‘upper  bound’)  which  appears 
in  simulations  to  be  ‘liber£ii’.  For  details  see  Section  5.  In  practice,  the  difference 
between  these  two  z-scores  is  of  little  importance,  for  with  a  large  sample  size  n,  one 
typically  does  not  reject  the  null  hypothesis  unless  both  z-scores  are  quite  large. 

As  a  variation  on  the  previous  example,  we  now  consider  a  data  set  with  cor¬ 
related  columns.  The  data  matrix  X  is  still  1024  x  10.  Observations  are  sampled 
from  a  MVN  distribution  having  a  correlation  of  .2  between  all  pairs  of  covariates. 
If  we  neglect  to  transform  this  data,  but  apply  steps  B-D  directly  to  the  raw  data, 
we  obtain  the  following  results. 

For  the  number  of  cuts  *  2, 

The  frequency  distribution  of  the  cell  counts  is: 
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0  1  2  3  -4  5678 
Observed  431.00  358.00  154.00  45.00  16.0  6.00  8.00  1.00  1.00 
Expected  376.71  376.71  188.35  62.78  15V7  3.14  0.52  0.0/  0.01 

9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25 
Observed  00110000010000001 
Expected  00000000000000000 

The  moments  of  the  distribution  of  cell  counts  are: 

mean  variance  skevness  kurtosis 
Observed  1  2.32031  6.50023  79.12713 
Expected  1  0.98930  0.98396  0.94713 

The  2-score  for  the  variance  of  cell  counts  =  30.441 


Even  a  rather  modest  correlation  of  .2  has  dramatically  altered  the  frequency 
distribution  of  the  cell  counts;  a  cell  count  of  18  or  25  is  essentially  impossible  under 
independence.  Thus  our  procedure  loudly  proclaims  that  there  is  structure  in  this 
data.  However,  the  only  structure  in  the  data  is  the  correlation  structure  which  is 
removed  using  either  transformation  T1  or  T2.  Using  T2  (with  upper  triangular  A) 
gives  us: 


For  the  number  of  cuts  ■  2^ 

The  frequency  distribution  of  the  cell  counts  is: 

0  1  2  345678 

Observed  393.00  351.00  189.00  71.00  18.0  2.00  0.00  0.00  0.00 
Expected  376.71  376.71  188.35  62.78  15.7  3.14  0.52  0.07  0.01 

The  moments  of  the  distribution  of  cell  counts  are: 

mean  veuriaince  skevness  kurtosis 
Observed  1  1.03516  0.90684  0.36108 
Expected  1  0.98930  0.98396  0.94713 

The  2-score  (conservative)  for  the  variance  of  cell  counts  =  1.049 
Upper  bound  for  2-score  ■  2.101 


This  output  indicates  that  no  structure  remains  in  the  transformed  data. 


Examples  with  Randomly  Located  Clusters 

We  now  consider  two  examples  where  the  data  consists  of  many  randomly  located 
clut  ters.  We  take  X  to  be  1U24  x  10  in  both  cases.  The  observations  in  X  are  made 
up  of  m  clusters.  The  cluster  centers  (denoted  /ij, ^2, . . .  are  independently 
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generated  from  a  MVN(0, 1)  distribution.  The  members  of  cluster  i  are  indepen¬ 
dently  generated  from  a  MVN(pi,<rI)  distribution.  Here  I  is  the  10  x  10  identity 
matrix.  In  both  examples,  the  value  of  a ,  wttch  controls  the  size  of  the  clusters, 
has  been  made  large  enough  so  that  there  is  little  evidence  of  clustering  (or  other 
structure)  visible  in  the  bivariate  scatter  plots.  A  careless  data  analyst  might  easily 
conclude  there  is  no  structure  in  the  data. 

Our  analysis  is  given  below.  In  both  cases,  the  data  has  been  transformed  using 
T2. 

The  data  in  our  first  example  is  composed  of  256  clusters,  each  containing  4 
observations.  The  value  of  a  is  0.2 .  The  output  for  d  =  2  is: 

For  the  number  of  cuts  ■  2, 

The  frequency  distribution  of  the  cell  counts  is: 

0  1  2345678 

Observed  538.00  225.00  105.00  80.00  48. u  19.00  3.00  4.00  2.00 
Expected  376.71  376.71  188.35  62.78  15.7  3.14  0.52  0.07  0.01 


The  moments  of  the  distribution  of  cell  counts  are: 

mean  variance  skewness  Kurtosis 
Observed  1  1.96875  1.64175  2.67800 
Expected  1  0.98930  0.98396  0.94713 

The  2-score  (conservative)  for  the  vaxiamce  of  cell  counts  *  22.401 
Upper  bound  for  z-score  ■  23.944 


This  clearly  signals  the  existence  of  some  type  of  structure  in  the  data. 

By  making  d  larger,  we  can  check  for  the  existence  of  clustering  or  nonuniformity 
on  a  smaller  .jcale.  With  d  =  4,  the  number  of  cells  in  our  contingency  teb^e  is 
4*°  =  1048576.  Storing  a  complete  contingency  table  this  large  would  require  too 
much  memory.  However,  since  n  =  1024,  the  vast  majority  of  these  cells  are  empty. 
Because  there  is  no  need  to  keep  track  of  the  empty  ceils,  the  amount  of  memory 
we  require  is  not  excessive.  Carrying  out  our  analysis  on  the  same  data  using  d  =  4 
leads  to: 

For  the  number  of  cuts  *  4, 

The  frequency  distribution  of  the  cell  counts  is: 

0  1  2  3 

Observed  1047634  868  66.0  8 

Expected  1047552  1023  0.5  0 
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The  moments  of  the  distribution  of  cell  counts  are: 

meem  variance  skewness  kurtosis 
Observed  0.00098  0.00115  39.47442  1885.986 

Expected  0.00098  0.00098  31.99860  1023.851 

The  z-score  (conservative)  for  the  variance  of  cell  counts  =  128.53 
Upper  bound  for  z-score  ■  128.565 


This  output  shows  that  the.  number  of  cells  containing  2  or  3  observations  is  much 
larger  than  one  would  expect  under  independence.  This  concludes  our  discussion  of 
the  first  exaunple. 

The  data  in  our  second  example  is  composed  of  16  clusters,  each  containing  64 
observations.  The  value  of  <r  is  0.7.  The  output  for  d  =  2  is: 

For  the  number  of  cuts  »  2, 

The  frequency  distribution  of  the  cell  counts  is: 

0  1  2  34  56789  10 

Observed  473.00  301.00  134.00  61.00  30.0  14.00  4.00  3.00  1.00  1  2 

Expected  376.71  376.71  188.35  62.78  15.7  3.14  0.52  0.07  0.01  0  0 

The  moments  of  the  distribution  of  cell  counts  are: 

mean  variance  skewness  kurtosis 
Observed  1  1.78516  2.17656  7.29766 
Expected  1  0.98930  0.98396  0.94713 

The  z-score  (conservative)  for  the  variance  of  cell  counts  =  18.202 
Upper  bound  for  z-score  ®  19.648 


Again,  our  analysis  clearly  shows  that  structure  exists  in  this  data. 

The  weakness  of  the  method  in  both  examples  is  that  it  gives  no  clear  indication 
of  the  nature  of  the  structure  which  is  detected. 


An  Example  Using  Speech  Data 

The  data  matrix  X  in  this  example  is  1507  x  10.  The  data  was  obtained  by  sampling 
from  a  much  larger  matrix  of  digitized  speech  data  consisting  of  10  dimensional  ‘Ipc’ 
vectors.  The  Ipc  vectors  in  this  sample  all  correspond  to  ‘unvoiced’  sounds. 

Suppose  the  object  of  our  analysis  is  to  find  out  if  the  ‘unvoiced’  Ipc  vectors  tend 
to  lie  in  clusters.  Examination  of  the  bivariate  scatter  plots  reveals  some  structure  in 
the  data  (curvature  and  heteroscedasticity),  but  no  evidence  of  clustering.  Applying 
transformation  Tl  (principal  components)  to  this  data  and  then  carrying  out  our 
analysis  for  d  =  4  leads  to  the  following  output: 
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For  the  number  of  cuts  *  4, 

The  frequency  distribution  of  the  cell  counts  is: 

0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16 

Observed  1047111  1463.00  0.00  0000001  0  0  0  0  0  0  0 

Expected  1047070  1504.84  1.08  0000000  0  0  0  0  0  0  0 

17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35 
Observed  0000000000000000001 
Expected  0000000000000000000 

The  moments  of  the  distribution  of  cell  counts  are: 

mean  variance  skewness  kurtosis 
Observed  0.00144  0.00264  317.00704  206605.0732 

Expected  0.00144  0.00144  26.37693  695.7016 

The  z-score  (conservative)  for  the  variance  of  cell  counts  =  611.618 
Upper  boiind  for  z-score  =  611.663 


Examining  the  frequency  distribution,  we  see  there  is  one  cell  which  conttiins  35 
observations!  There  is  another  cell  which  contains  9  observations.  If  the  coordinates 
were  in  fact  independent,  one  would  not  expect  to  see  any  cells  with  more  than  2 
observations,  so  these  two  cells  must  be  regarded  as  quite  unusual.  A  little  detective 
work  reveals  that  these  two  cells  are  neighboring  cells  lying  close  to  the  center  of  the 
data  set.  Thus,  this  data  set  contains  a  small,  but  relatively  dense  cluster  (containing 
around  44  =  35  +  9  observations)  near  its  center.  In  terms  of  speech,  I  do  not  know 
what  this  cluster  corresponds  to.  It  may  turn  out  to  be  of  no  importance. 

Are  there  any  other  clusters  in  the  data?  In  order  to  investigate  this  question, 
cill  but  one  (43  out  of  44)  of  the  observations  in  the  two  unusueil  cells  were  deleted 
from  the  data  set.  Reanalyzing  the  data  (using  transformation  T1  and  d  =  2)  leads 
to  the  following: 

For  the  number  of  cuts  =  2, 

The  frequency  distribution  of  the  cell  counts  is: 

0  1  2  345678 

Observed  278.00  323.00  230.00  127.00  38.00  22.0  5.00  0.00  1.00 
Expected  245.13  350.46  250.52  119.39  42.67  12.2  2.91  0.59  0.11 

9 

Observed  0.00 
Expected  0.02 
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The  moments  of  the  distribution  of  cell  counts  aure: 

mean  variance  skewness  kuz^osis 

Observed  1.42969  1.65521  0.96481  1.00909 

Expected  1.42969  1.41437  0.82291  0.66243 

The  z-score  (conservative)  for  the  variance  of  cell  counts  =  3.847 
Upper  bound  for  z-score  *  4.962 

This  suggests  that  some  structure  exists  (which  we  already  knov;  from  examining 
the  bivariate  scatter  plots),  but  that  there  is  no  clustering  which  is  as  pronounced 
or  dramatic  as  that  found  in  the  earlier  simulated  exaunples.  If  any  clusters  remain 
in  this  data  set,  they  are  not  very  well  defined. 

4  The  Null  Distribution 

In  this  section  we  shall  give  expressions  for  the  moments  and  distribution  of  U^r, 
the  number  of  observations  contained  in  ceil  tt.  We  also  present  a  formula  for  the 
variance  of  the  quantity  W  defined  in  equation  2.3 .  All  these  results  are  derived 
under  the  assumption  that  the  coordinates  are  independent;  under  this  assumption 
the  results  are  exact.  If  a  data-dependent  transformation  (such  as  T1  or  T2)  has 
been  applied  to  the  data,  the  results  can  only  be  regarded  as  approximations.  Proofs 
for  these  results  are  given  in  an  appendix. 

The  following  notation  will  be  useful.  For  any  real  number  x  ajid  positive  integer 
k  we  define  ^ 

{i)k = n'(x  -  j)  (4.1) 

1=0 

so  that,  for  instance,  the  combinatoriad  coefficient  may  be  written  £is  {x)k/k\ . 

Let  be  the  probability  that  the  first  k  observations  belong  to  the  cell  tt.  More 
formally, 

=  Pr{<,.  =  TT  for  1  <  1  <  fc} 

where  denotes  the  row  of  T.  Using  urn  model  arguments  (sampling  wi  .hout 
replacement)  it  is  easy  to  show  that 


The  factorial  moments  of  f/,r  can  now  be  given  as 

E{Ur)k  =  {n)k(k.  (4.2) 

The  ordinary  moments  (about  the  origin)  can  be  obtained  directly  from  the  factorial 
moments.  The  general  formula  is  given  in  the  appendix.  As  special  cases  we  note 
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that 


EU.  =  =  J 

EU^  =  n6  +  n(r»-l)6 

EU^  =  +  6n{n  -  1)^2  +  n(n  -  l)(n  -  2)6 

EU*  =  n6  +  7n(n  —  1)6  +  6n(n  —  l)(n  —  2)6  +  —  l)(n  —  2)(n  —  3)6 


Finally,  the  central  moments  may  be  obtained  from  the  moments  about  the  origin 
via  the  formula 

This  is  the  route  used  to  calculate  the  moments  reported  in  the  computer  output  in 
the  Examples  secti<-n.  The  skewness  and  kurtosis  are,  as  usual,  defined  by 


skewness  = 


E{u.  - 


and 


kurtosis  = 


E{U.  -  fiY 


-3 


where  fi  and  are  the  mean  and  variance  of  C/,  • 

The  exact  distribution  of  the  cell  counts  is  given  by 


=  (3  (4.3) 


where  6  =  1  •  practice,  the  series  is  truncated  when  the  terms  become  sufficiently 
small.  This  series  should  give  accurate  results  when  the  terms  decay  to  zero  rapidly, 
but  the  alternating  signs  may  render  the  formula  useless  when  the  rate  of  decay  is 
slow.  The  expected  frequency  distribution  reported  in  the  computer  output  in  the 
Examples  section  was  obtained  from  the  Poisson  approximation  in  (2.2)  . 

We  now  show  how  to  compute  the  quantities  and  occurring  in  equation 
2.4  for  the  z-scores.  We  shall  freely  use  the  ordin2u:y  and  factorial  moments  which 
are  easily  calculated  using  the  previously  given  formulas.  First  the  mean: 

=  EUl  -  (4.4) 


Now  for  the  variance.  Define 


Oi*  =  (->,.*  {(5)  ^^^  +  (1-5) 


{n/d}j{n/d)i 


Then  we  can  write 


Var(W)  =  Q22  +  2Qi2 -b  Qu  -  {EU^?  (4-5) 

+  1  {iE{U.h  +  mU.)2  +  EUr)  . 
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This  expression  is  easy  to  compute  with,  but  rather  difficult  to  comprehend.  When 
n  and  are  both  large,  useful  approximations  are  given  by 


fJtw 


—  and  <T 
dP 


2n^ 
<Pp  ■ 


(4.6) 


Note  that  W  =  {n/cP’’)x^  where  is  the  usucd  test  statistic  for  independence  in 
contingency  tables.  Thus  (4.6)  corresponds  to  the  fact  that  the  variance  of  a  x^ 
rajidom  variable  is  twice  its  mean.  These  formulas  are  based  upon  (P  degrees  of 
freedom  and  will  somewhat  overstate  the  actual  meam  and  variance. 


5  The  Upper  Bound  for  z-scores 

As  noted  earlier,  if  the  data  has  been  transformed  (perhaps  by  using  T1  or  T2)  to 
remove  all  correlations,  the  z-scores  computed  from  (2.4)  will  be  conservative  (too 
small).  The  following  correction  tends  to  produce  a  z-score  which  is  liberal  (too 
large).  The  two  z-scores  give  us  a  likely  range  of  values  which  c«in  be  used  to  carry 
out  crude  hypothesis  tests. 

Define 

,  =  _ 0 _ 1 . 

dP-p{d-\)-\ 

The  liberal  z-score  is  given  by 


W  —  rfi^ 

(Tros/r 


(5.1) 


The  vcdues  of  and  are  those  defined  in  the  previous  section.  Note  that 
is  the  number  of  correlations  which  are  estimated,  that  is,  removed  from  the  data, 
and  d^  —  p{d  —  1)  —  1  is  the  number  of  degrees  of  freedom  in  our  d  x  d  x  ■  ■  ■  x  d 
contingency  table. 
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appendix 


In  section  2  (step  B),  we  defined  the  n  x  p  matrix  T.  Under  the  assumption  that 
the  coordinates  are  independent,  it  is  clear  that  the  columns  of  T  are  independent. 
In  the  process  of  constructing  T,  we  assume  that  d  divides  n  exactly.  Thus  each 
column  consists  of  values  {1, 2, . . . ,  d}  with  n/d  repetitions  of  each  value,  and  entries 
in  a  column  are  exchangeable. 

Using  um  model  arguments  (  Scimpling  without  replacement  )  and  by  indepen¬ 
dence  of  columns,  we  have 


= 


(n/d). 


{n)k  J 


(A.l) 


Let  be  the  class  of  subsets  in  {1,2, . . . ,  n}  with  the  cardinality  k.i.e. 


=  {(T  C  {1,2 . n}  :  |cr|  =  k} 


for  )b  =  1, 2, . . . ,  n,  where  |cr|,  the  cardinality  of  cr,  is  the  number  of  elements  in  the 
set  <T.  Then  it  is  easy  to  show  the  relation 


I  ( U.  =  TT,  Vt  €  (t), 


iA.2) 


where  ti,  denotes  the  row  of  T.  By  exchangeability  of  rows,  we  have 

.("■) . 

From  (A.3),  the  factorial  moments  of  Uw  becomes 

E{U.)k  =  {n)k^k. 

The  ordinary  moments  (  about  the  origin  )  can  be  obtained  directly  from  the 
factorial  moments.  This  yields 

where  is  a  Stirling  number  of  the  second  kind  ;  see  Abraraowitz  and  Stegun 
( 1970)  for  definitions  and  tables  of  these  numbers. 

To  derive  the  exact  distribution  of  we  can  use  the  relation 

I{U^  =  it)  =  ^  /(<,.  =  7r,Vt  G  (7,  U.  #  7r,Vi  ^  <t). 

<t€c; 
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Thus  by  exchangeability  of  rows,  we  have 


P{U.=-k)  = 


i=l  t=k+l 

'  '  L»=l  i=i+l 

=(:)g("r)— 


(A.6) 


We  now  show  how  to  derive  the  variance  of  W  in  (2.3).  First,  we  will  simplify 
the  formula  for  Var{W). 


=  E  -  (EUl)\  (A.1) 

where  111,112  are  iid  uniform  random  variables  on  the  set  of  all  possible  cells.  Now 
we  can  derive  the  relation 


I  ( —  Hi,  V i  €  O’!,  —  Ho,  V z  €  (T2) 

^n,n3.  (^-8) 

where  is  the  Kronecker  delta,  and  the  summation  on  the  left  hand  side  is  over 

all  CTi  G  (7”,<T2  €  Cfc  with  <ri  n<T2  =  0-  First,  take  expectation  on  the  left  hand  side 
of  equation  (i4.8).  Then  by  exchangeability  of  rows,  it  becomes 

=  Hi,  1  <  z  <  j,  U,  =  n2,  ;  +  1  <  I  <  j  +  A: ) .  (A9) 

Using  the  independence  of  columns  of  T,  and  using  the  fact  that  if  Hi  and  112  are 
iid  uniform  random  variables,  then  coordinates  of  Hi  are  independent  of  those  of 
02,  we  can  show  that  (i4.9)  becomes 


”  /  1  {nld)j^k  .  A  _  1") 

j,kl\d  (n)>+*  V  d)  (n)y+fc  J 


(A.IO) 
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Now  take  expectation  on  the  right  heind  side  of  equation  (A. 8)  to  get 

•(‘'DC'rj-iMO-raG')] 


(A.ll) 


Define 


_  1\  {nld),{nld)kY 
d)  (n)>+*  i 


{n),>*  V  dj  {n)i^k  j 

By  combining  (A. 10)  and  (A.ll)  and  then  by  multiplying  J'.k!  on  both  sides,  we 


E{{Un,},(Un,}k}  +  {  (17,),+^  -  (U.h(U.)k}  =  Q,u 


E  { {UnMUn,)k}  =  { {U.),{U^)k  -  {U.),^k)  • 


(A.12) 


Thus 


Var{W)=E{Ul^h%)-{EUlf 

=  E{{UnMUn,)2  +  2UnAUn,h  +  -  [EUlf 

=  Q22  +  2Qi2  +  Qii  —  (EU^) 

+  j^E  {  {UMU.)2  +  2U.{U,)2  +  Ul-  {U.U  -  2{U.h  -  {U.)2} 

=  Q22  +  2Q,2  +  Qn  -  {EUlf  +  LE{Ut-  {U.)a  -  2{U.)3  -  {U^)2} 

=  Q22  +  2Qi2  +  Qii  -  {EUlf  +  ^  +  ^E{U^)2  +  EU.IA.IZ) 

The  statistic  W  is  proportional  to  the  usual  ^^st  statistic  for  independence  in 
contingency  tables.  When  p  =  2,  our  formula  for  Var{W)  is  equivalent  to  a  special 
case  of  the  HaldaneDawson  formula  (Haldane.  1939;  Dawson,  1954)  for  the  variance 
of  X*- 

In  order  to  calculate  an  approximate  value  of  Var[W)  for  large  n  and  d^,  we 
can  use  the  following  approximation  :  for  large  y  and  for  small  k  and  j, 


{y)}+k  _  {y  -  j)k 
{y)iiy)k  {y)k 


i>i  =  TT7i_.J_)  {A.U} 

‘  ri  \  S' 


Thus 


/  W  X  Wi-ffc  [  fl  {rifd)j+k  ,  i_iV 

^  '  I  {n)j{n)k  1  {n}j{n}k  \  (n)>+fc  /  \d  {n/d)j{n/d)k  ^  dj 
=  (1  -  E{U,),E{U.)k.  (A.15) 
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Hence 


Q22  +  2C?i2 -r  Qii  — 

=  Q22  -  {EiUM'^  +  2 {Qn  -  EU.E(UM  +  Qn  "  (EU^f 

^-^{2E{U^)2  +  EU.}\  (A.16) 

Finaily  by  using  E{U^)k  »  n‘‘fd'‘^,  we  have 


Var{W)  «  -i  {2E{U^)2  +  EU^}^  +  1  {4i;(t7,)3  +  6E{U^)2  +  EU.} 


4{ 

2 

2—  +  — 
<pp  dP\ 

•n}  n 

to 

(A.17) 
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the  data,  and  finally,  studying  the  cell  counts  in  the  resulting  contingency 
table.  If  this  procedure  detects  structure,  we  can  then  use  more  computationally 
intensive  methods  to  determine  the  nature  of  this  structure. 
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